64 research outputs found
La minerÃa de datos, entre la estadÃstica y la inteligencia artificial
En la pasada década hemos asistido a la irrupción de un nuevo concepto en el mundo empresarial: el data mining (minerÃa de datos). Algunas empresas han implementado unidades de minerÃa de datos estrechamente vinculadas a la dirección de la empresa y en los foros empresariales las sesiones dedicadas a la minerÃa de datos han sido las protagonistas. La minerÃa de datos se presenta como una disciplina nueva, ligada a la Inteligencia Artificial y diferenciada de la EstadÃstica. Por otro lado, en el mundo estadÃstico más académico, la minerÃa de datos ha sido considerada en su inicio como una moda más, aparecida después de los sistemas expertos, conocida desde hacÃa tiempo bajo el nombre de "data fishing".
¿Es esto realmente asÃ? En este artÃculo abordaremos las raÃces estadÃsticas de la minerÃa de datos, los problemas que trata, haremos una panorámica sobre el alcance actual de la minerÃa de datos, presentaremos un ejemplo de su aplicación en el mundo de la audiencia de televisión y, por último, daremos una visión de futuro
Descripció i classificació de les comarques catalanes en regions homogènies segons l'ús de la terra
The theme of this article is the application of techniques of exploratory statistics to the study of comprehensive numerical tables consisting of statistics of a spatial nature. The immensity of statistics compiled over a large area, as in the case of a population census, frequently makes it difficult to assimilate all the information contained therein. It is shown that the mentioned techniques of analysis make possible a profound understanding of such statistics without resorting to the inspection of the said tables. The objectives usually pursued are: (1) to emphasize the most outstanding characteristics of the statistics, such as associations andlor contrasts in the elements under study, an objective which is easily fulfilled through methods of descriptive factorial analysis; (2) to group the basic elements of study into a limited number of representative classes, which can likewise be easily achieved through a simple algorithrn of ascendent hierarchical classification. The aplication of this method demonstrates the compatibility of the two results. This normally corresponds to the final stage in the study of statistical tables, in which observations relate to small areas points. The natural desire to make the classes obtained coincide with geographical regions made necessary the introduction of the content relationship within the algorithm of ascendent hierarchical classification. The application undertaken makes it possible to identify improvements in the interpretation of the classes obtained.Postprint (published version
The Longitudinal nature of patent value and technological usefulness exploring PLS structural equation models
The purpose of this paper is to investigate the evolution of patent value and technological usefulness over time using longitudinal structural equation models. The variables are modeled as endogenous unobservable variables which depend on three exogenous constructs: the knowledge stock used by companies to create their inventions, the technological scope of the inventions and the international scope of protection. Two set-ups are explored. The rst longitudinal model includes time-dependent manifest variables and the second includes time-dependent unobservable variables. The structural equation models are estimated using Partial Least Squares Path Modelling. We showed that there is a trade-o between the exogenous latent variables and technological usefulness over time. This means that the former variables become less important and the latter more important as time passes.Preprin
PRESISTANT: Learning based assistant for data pre-processing
Data pre-processing is one of the most time consuming and relevant steps in a
data analysis process (e.g., classification task). A given data pre-processing
operator (e.g., transformation) can have positive, negative or zero impact on
the final result of the analysis. Expert users have the required knowledge to
find the right pre-processing operators. However, when it comes to non-experts,
they are overwhelmed by the amount of pre-processing operators and it is
challenging for them to find operators that would positively impact their
analysis (e.g., increase the predictive accuracy of a classifier). Existing
solutions either assume that users have expert knowledge, or they recommend
pre-processing operators that are only "syntactically" applicable to a dataset,
without taking into account their impact on the final analysis. In this work,
we aim at providing assistance to non-expert users by recommending data
pre-processing operators that are ranked according to their impact on the final
analysis. We developed a tool PRESISTANT, that uses Random Forests to learn the
impact of pre-processing operators on the performance (e.g., predictive
accuracy) of 5 different classification algorithms, such as J48, Naive Bayes,
PART, Logistic Regression, and Nearest Neighbor. Extensive evaluations on the
recommendations provided by our tool, show that PRESISTANT can effectively help
non-experts in order to achieve improved results in their analytical tasks
Perfil profesional del ingeniero informático: diagnóstico basado en competencias
Las universidades deben formar los ingenieros que la sociedad necesita. Los planes de Estudios del EEES deben ser diseñados, por tanto, a partir de las competencias profesionales requeridas por la sociedad. Cada escuela, no obstante, tiene su propia idiosincrasia, y debe escoger las competencias que sus egresados poseerán al finalizar los estudios y diseñar su plan de estudios a partir de dichas competencias. La selección de las competencias definirá el perfil profesional de sus titulados, por lo que es preciso disponer de elementos objetivos que permitan realizar adecuadamente esta selección.
En este artÃculo se presenta el resultado de las encuestas realizadas a varios cientos de profesionales y a un conjunto de alumnos y profesores de la Facultat d’Informà tica de Barcelona. Las encuestas muestran el grado de importancia que los profesionales dan a cada competencia, y por lo tanto definen un perfil profesional. También muestran cómo perciben su aprendizaje los profesores y los estudiantes.Peer Reviewe
Disseny del Pla de Mostreig per l’estimació de la fracció de Residus Resta en la bossa tipus de Catalunya
Informe Final de la FASE 1 del Contracte Menor de Serveis efectuat per Barcelona Ecologia a la Universitat Politècnica de CatalunyaPreprin
Intelligent assistance for data pre-processing
A data mining algorithm may perform differently on datasets with different characteristics, e.g., it might perform better on a dataset with continuous attributes rather than with categorical attributes, or the other way around. Typically, a dataset needs to be pre-processed before being mined. Taking into account all the possible pre-processing operators, there exists a staggeringly large number of alternatives. As a consequence, non-experienced users become overwhelmed with pre-processing alternatives. In this paper, we show that the problem can be addressed by automating the pre-processing with the support of meta-learning. To this end, we analyzed a wide range of data pre-processing techniques and a set of classification algorithms. For each classification algorithm that we consider and a given dataset, we are able to automatically suggest the transformations that improve the quality of the results of the algorithm on the dataset. Our approach will help non-expert users to more effectively identify the transformations appropriate to their applications, and hence to achieve improved results.Postprint (author's final draft
On the effect of measurementmodel misspecification in PLS Path Modeling: the reflective case
The specification of a measurement model as reflective or formative is the object of a lively
debate. Part of the existing literature focuses on measurement model misspecification. This
means that a true model is assumed and the impact on the path coefficients of using a wrong
model is investigated. The majority of these studies is restricted to Structural Equation
Modeling (SEM). Regarding PLS-Path Modeling (PLS-PM), a few authors have carried out
simulation studies to investigate the robustness of the estimates, but their focus is the
comparison with SEM. The present paper discusses the misspecification problem in the PLSPM
context from a novel perspective. First, a real application on Alumni Satisfaction will be
used to verify whether different assumptions for the measurements models influence the
results. Second, the results of a Monte-Carlo simulation study, in the reflective case, will help
to bring some clarity on a complex problem that has not been sufficiently studied yet
On the predictive power of meta-features in OpenML
The demand for performing data analysis is steadily rising. As a consequence, people of different profiles (i.e., non-experienced users) have started to analyze their data. However, this is challenging for them. A key step that poses difficulties and determines the success of the analysis is data mining (model/algorithm selection problem). Meta-learning is a technique used for assisting non-expert users in this step. The effectiveness of meta-learning is, however, largely dependent on the description/characterization of datasets (i.e., meta-features used for meta-learning). There is a need for improving the effectiveness of meta-learning by identifying and designing more predictive meta-features. In this work, we use a method from exploratory factor analysis to study the predictive power of different meta-features collected in OpenML, which is a collaborative machine learning platform that is designed to store and organize meta-data about datasets, data mining algorithms, models and their evaluations. We first use the method to extract latent features, which are abstract concepts that group together meta-features with common characteristics. Then, we study and visualize the relationship of the latent features with three different performance measures of four classification algorithms on hundreds of datasets available in OpenML, and we select the latent features with the highest predictive power. Finally, we use the selected latent features to perform meta-learning and we show that our method improves the meta-learning process. Furthermore, we design an easy to use application for retrieving different meta-data from OpenML as the biggest source of data in this domain.Peer ReviewedPostprint (published version
Modelling with heterogeneity
We present in this paper a methodology to deal with heterogeneity in modelling when the sources are unknown. Although the approach is general we present it for the PLS-PM latent variable modelling. We call such approach PATHMOX. The idea behind PATHMOX is to build a path models tree having a binary decision tree look-alike structure with models for different segments in each of its nodes. The split criterion consists in an F statistic for comparing structural models based on testing the equality of the path coefficients. We emphasize the rationale of such approach and its limitations. Finally we present an application to an Alumni Satisfaction survey.Peer ReviewedPostprint (published version
- …